Search CORE

ProMiner: rule-based protein and gene entity recognition

Author: Bmc Bioinformatics
Daniel Hanisch
Heinz-theodor Mevissen
Juliane Fluck
Katrin Fundel
Ralf Zimmer
Publication venue: BioMed Central
Publication date: 01/01/2005
Field of study

doi:10.1186/1471-2105-6-S1-S14 <supplement> <title> <p>A critical assessment of text mining methods in molecular biology</p> </title> <editor>Christian Blaschke, Lynette Hirschman, Alfonso Valencia, Alexander Yeh</editor> <note>Report</note> </supplement> Background: Identification of gene and protein names in biomedical text is a challenging task as the corresponding nomenclature has evolved over time. This has led to multiple synonyms for individual genes and proteins, as well as names that may be ambiguous with other gene names or with general English words. The Gene List Task of the BioCreAtIvE challenge evaluation enables comparison of systems addressing the problem of protein and gene name identification on common benchmark data. Methods: The ProMiner system uses a pre-processed synonym dictionary to identify potential name occurrences in the biomedical text and associate protein and gene database identifiers with the detected matches. It follows a rule-based approach and its search algorithm is geared towards recognition of multi-word names [1]. To account for the large number of ambiguous synonyms in the considered organisms, the system has been extended to use specific variants of the detection procedure for highly ambiguous and case-sensitive synonyms. Based on all detected synonyms fo

CiteSeerX

Crossref

Fraunhofer-ePrints

Gene and protein nomenclature in public databases

Author: AA Morgan
AS Schwartz
D Hanisch
D Hanisch
E Adar
E Brill
H Liu
H Liu
H Yu
JT Chang
K Fundel
Katrin Fundel
L Chen
L Hirschman
L Hirschman
M Szugat
M Weeber
O Tuason
Ralf Zimmer
T Ono
V Hatzivassiloglou
Y Tsuruoka
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Frequently, several alternative names are in use for biological objects such as genes and proteins. Applications like manual literature search, automated text-mining, named entity identification, gene/protein annotation, and linking of knowledge from different information sources require the knowledge of all used names referring to a given gene or protein. Various organism-specific or general public databases aim at organizing knowledge about genes and proteins. These databases can be used for deriving gene and protein name dictionaries. So far, little is known about the differences between databases in terms of size, ambiguities and overlap. RESULTS: We compiled five gene and protein name dictionaries for each of the five model organisms (yeast, fly, mouse, rat, and human) from different organism-specific and general public databases. We analyzed the degree of ambiguity of gene and protein names within and between dictionaries, to a lexicon of common English words and domain-related non-gene terms, and we compared different data sources in terms of size of extracted dictionaries and overlap of synonyms between those. The study shows that the number of genes/proteins and synonyms covered in individual databases varies significantly for a given organism, and that the degree of ambiguity of synonyms varies significantly between different organisms. Furthermore, it shows that, despite considerable efforts of co-curation, the overlap of synonyms in different data sources is rather moderate and that the degree of ambiguity of gene names with common English words and domain-related non-gene terms varies depending on the considered organism. CONCLUSION: In conclusion, these results indicate that the combination of data contained in different databases allows the generation of gene and protein name dictionaries that contain significantly more used names than dictionaries obtained from individual data sources. Furthermore, curation of combined dictionaries considerably increases size and decreases ambiguity. The entries of the curated synonym dictionary are available for manual querying, editing, and PubMed- or Google-search via the ProThesaurus-wiki. For automated querying via custom software, we offer a web service and an exemplary client application

Crossref

Directory of Open Access Journals

Public Library of Science (PLOS)

Phenocopy – A Strategy to Qualify Chemical Compounds during Hit-to-Lead and/or Lead Optimization

Author: Baum Patrick
Baur Martin
Eils Roland
Fundel-Clemens Katrin
Gantner Florian
Gruenbaum Lore
Heckel Armin
Ittrich Carina
Kontermann Roland E.
Mara Lisa
Mennerich Detlev
Park John E.
Quast Karsten
Roth Gerald J.
Rust Werner
Schmid Ramona
Schnapp Andreas
Siewert Susanne
Weith Andreas
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

A phenocopy is defined as an environmentally induced phenotype of one individual which is identical to the genotype-determined phenotype of another individual. The phenocopy phenomenon has been translated to the drug discovery process as phenotypes produced by the treatment of biological systems with new chemical entities (NCE) may resemble environmentally induced phenotypic modifications. Various new chemical entities exerting inhibition of the kinase activity of Transforming Growth Factor β Receptor I (TGF-βR1) were qualified by high-throughput RNA expression profiling. This chemical genomics approach resulted in a precise time-dependent insight to the TGF-β biology and allowed furthermore a comprehensive analysis of each NCE's off-target effects. The evaluation of off-target effects by the phenocopy approach allows a more accurate and integrated view on optimized compounds, supplementing classical biological evaluation parameters such as potency and selectivity. It has therefore the potential to become a novel method for ranking compounds during various drug discovery phases

CiteSeerX

Directory of Open Access Journals

Overview of BioCreative II gene normalization

Author: Cohen Aaron M
Cohen K Bretonnel
Divoli Anna
Fluck Juliane
Fundel Katrin
Hakenberg Jörg
Hirschman Lynette
Hsu Chun-Nan
Krauthammer Michael
Lau William W
Leaman Robert
Liu Heng-hui
Liu Hongfang
Lu Zhiyong
Morgan Alexander A
Ruch Patrick
Schuemie Martijn
Sun Chengjie
Torres Rafael
Wang Xinglong
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Background: The goal of the gene normalization task is to link genes or gene products mentioned in the literature to biological databases. This is a key step in an accurate search of the biological literature. It is a challenging task, even for the human expert; genes are often described rather than referred to by gene symbol and, confusingly, one gene name may refer to different genes (often from different organisms). For BioCreative II, the task was to list the Entrez Gene identifiers for human genes or gene products mentioned in PubMed/MEDLINE abstracts. We selected abstracts associated with articles previously curated for human genes. We provided 281 expert-annotated abstracts containing 684 gene identifiers for training, and a blind test set of 262 documents containing 785 identifiers, with a gold standard created by expert annotators. Inter-annotator agreement was measured at over 90%. Results: Twenty groups submitted one to three runs each, for a total of 54 runs. Three systems achieved F-measures (balanced precision and recall) between 0.80 and 0.81. Combining the system outputs using simple voting schemes and classifiers obtained improved results; the best composite system achieved an F-measure of 0.92 with 10-fold cross-validation. A 'maximum recall' system based on the pooled responses of all participants gave a recall of 0.97 (with precision 0.23), identifying 763 out of 785 identifiers. Conclusion: Major advances for the BioCreative II gene normalization task include broader participation (20 versus 8 teams) and a pooled system performance comparable to human experts, at over 90% agreement. These results show promise as tools to link the literature with biological databases

Crossref

Fraunhofer-ePrints

Erasmus University Digital Repository

EUR Research Repository

Archive ouverte UNIGE

Control of S-phase genes in fission yeast

Author: Apostolakis Joannis
Fundel Katrin
Güttler Daniel
Zimmer Ralf
Publication venue: The University of Edinburgh
Publication date: 01/01/1992
Field of study

Background: Significant parts of biological knowledge are available only as unstructured text in articles of biomedical journals. By automatically identifying gene and gene product (protein) names and mapping these to unique database identifiers, it becomes possible to extract and integrate information from articles and various data sources. We present a simple and efficient approach that identifies gene and protein names in texts and returns database identifiers for matches. It has been evaluated in the recent BioCreAtIvE entity extraction and mention normalization task by an independent jury. Methods: Our approach is based on the use of synonym lists that map the unique database identifiers for each gene/protein to the different synonym names. For yeast and mouse, synonym lists were used as provided by the organizers who generated them from public model organism databases. The synonym list for fly was generated directly from the corresponding organism database. The lists were then extensively curated in largely automated procedure and matched against MEDLINE abstracts by exact text matching. Rule-based and support vector machine-based post filters were designed and applied to improve precision. Results: Our procedure showed high recall and precision with F-measures of 0.897 for yeast and 0.764/0.773 for mouse in the BioCreAtIvE assessment (Task 1B) and 0.768 for fly in a post-evaluation. Conclusion: The results were close to the best over all submissions. Depending on the synonym properties it can be crucial to consider context and to filter out erroneous matches. This is especially important for fly, which has a very challenging nomenclature for the protein name identification task. Here, the support vector machine-based post filter proved to be very effective